The Construction of a Chinese-English Patent Parallel Corpus

نویسندگان

  • Bin Lu
  • Benjamin K. Tsou
  • Jingbo Zhu
  • Tao Jiang
  • Yee Kwong
چکیده

In this paper, we describe the construction of a parallel Chinese-English patent sentence corpus which is created from noisy parallel patents. First, we use a publicly available sentence aligner to find parallel sentence candidates in the noisy parallel data. Then we compare and evaluate three individual measures and different ensemble techniques to sort the parallel sentence candidates according to the confidence score and filter out those with low scores as the noisy data. The experiment shows that the combination of measures outperforms the individual measures, and that filtering out low-quality sentence pairs is readily justified as it can improve SMT performance. Finally, we arrive at the final corpus consisting of 160K sentence pairs in which about 90% are correct or partially correct alignments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese-English Parallel Corpus Construction and its Application

Chinese-English parallel corpora are key resources for Chinese-English cross-language information processing, Chinese-English bilingual lexicography, Chinese-English language research and teaching. But so far large-scale Chinese-English corpus is still unavailable yet, given the difficulties and the intensive labours required. In this paper, our work towards building a large-scale Chinese-Engli...

متن کامل

Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction

This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually trans...

متن کامل

COPPA V2.0: Corpus Of Parallel Patent Applications Building Large Parallel Corpora with GNU Make

WIPO seeks to help users and researchers to overcome the language barrier when searching patents published in different languages. Having collected a big multilingual corpus of translated patent applications, WIPO decided to share this corpus in a product called COPPA (Corpus Of Parallel Patent Applications) to stimulate research in Machine Translation and in language tools for patent texts. A ...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

A Japanese-English Patent Parallel Corpus

We describe a Japanese-English patent parallel corpus created from the Japanese and US patent data provided for the NTCIR-6 patent retrieval task. The corpus contains about 2 million sentence pairs that were aligned automatically. This is the largest Japanese-English parallel corpus, which will be available to the public after the 7th NTCIR workshop meeting. We estimated that about 97% of the s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009